AITopics | attack objective

Collaborating Authors

attack objective

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Adversarial observations in probabilistic State-Space Models for robust Reinforcement Learning

Santos-Pascual, M., Insua, D. Ríos

arXiv.org Machine LearningJun-23-2026

Machine learning (ML) systems increasingly support decision-making in high-stakes settings such as robotics, autonomous systems, finance, homeland security, and critical infrastructure protection. In these domains, robustness and reliability are essential because failures can translate into physical harm, financial loss, or operational breakdown (García and Fernández, 2015). A recurring weakness is that many ML pipelines implicitly assume that training and deployment data are independent and identically distributed (i.i.d.), even though real deployments often violate this assumption through sensor drift, changing environments, and distribution shift (Quiñonero-Candela et al., 2009). In security-relevant contexts, this problem is amplified because adversaries can deliberately manipulate observations, rewards, or the environment to induce targeted shifts and drive the system toward failure (Barreno et al., 2006; Biggio and Roli, 2018; Vassilev et al., 2024). These concerns motivate the relatively recent field of adversarial machine learning (AML), which studies how malicious perturbations can break learning systems and how to design defenses against them (Biggio and Roli, 2018; Goodfellow, Shlens and Szegedy, 2015).

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Machine Learning

2606.2088

Country:

North America > United States (0.28)
Europe > United Kingdom > England (0.28)

Genre: Research Report (0.50)

Industry:

Information Technology > Security & Privacy (0.93)
Government (0.86)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
(2 more...)

Add feedback

c1f0b856a35986348ab3414177266f75-Paper-Conference.pdf

Neural Information Processing SystemsFeb-16-2026, 22:29:19 GMT

Large language models are now tuned to align with the goals of their creators, namely to be "helpful and harmless." These models should respond helpfully to user questions, but refuse to answer requests that could cause harm. However, adversarial users can construct inputs which circumvent attempts at alignment. In this work, we study adversarial alignment, and ask to what extent these models remain aligned when interacting with an adversarial user who constructs worst-case inputs (adversarial examples). These inputs are designed to cause the model to emit harmful content that would otherwise be prohibited. We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models: even when current NLP-based attacks fail, we can find adversarial inputs with brute force.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
Europe > Switzerland > Zürich > Zürich (0.04)
Europe > Germany > Baden-Württemberg > Karlsruhe Region > Heidelberg (0.04)
(2 more...)

Genre: Research Report (0.46)

Industry:

Information Technology > Security & Privacy (0.93)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

neurips_attack_recsys

haoyang

Neural Information Processing SystemsFeb-11-2026, 18:06:19 GMT

attacker, target item, user behavior, (13 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.47)

Add feedback

Towards Reasonable Budget Allocation in Untargeted Graph Structure Attacks via Gradient Debias

Neural Information Processing SystemsDec-25-2025, 01:03:54 GMT

It has become cognitive inertia to employ cross-entropy loss function in classification related tasks. In the untargeted attacks on graph structure, the gradients derived from the attack objective are the attacker's basis for evaluating a perturbation scheme. Previous methods use negative cross-entropy loss as the attack objective in attacking node-level classification models. However, the suitability of the cross-entropy function for constructing the untargeted attack objective has yet been discussed in previous works. This paper argues about the previous unreasonable attack objective from the perspective of budget allocation. We demonstrate theoretically and empirically that negative cross-entropy tends to produce more significant gradients from nodes with lower confidence in the labeled classes, even if the predicted classes of these nodes have been misled. To free up these inefficient attack budgets, we propose a simple attack model for untargeted attacks on graph structure based on a novel attack objective which generates unweighted gradients on graph structures that are not affected by the node confidence. By conducting experiments in gray-box poisoning attack scenarios, we demonstrate that a reasonable budget allocation can significantly improve the effectiveness of gradient-based edge perturbations without any extra hyper-parameter.

attack objective, reasonable budget allocation, untargeted graph structure attack, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.40)

Add feedback

From Insight to Exploit: Leveraging LLM Collaboration for Adaptive Adversarial Text Generation

Sultana, Najrin, Rashid, Md Rafi Ur, Gu, Kang, Mehnaz, Shagufta

arXiv.org Artificial IntelligenceNov-6-2025

LLMs can provide substantial zero-shot performance on diverse tasks using a simple task prompt, eliminating the need for training or fine-tuning. However, when applying these models to sensitive tasks, it is crucial to thoroughly assess their robustness against adversarial inputs. In this work, we introduce Static Deceptor (StaDec) and Dynamic Deceptor (DyDec), two innovative attack frameworks designed to systematically generate dynamic and adaptive adversarial examples by leveraging the understanding of the LLMs. We produce subtle and natural-looking adversarial inputs that preserve semantic similarity to the original text while effectively deceiving the target LLM. By utilizing an automated, LLM-driven pipeline, we eliminate the dependence on external heuristics. Our attacks evolve with the advancements in LLMs and demonstrate strong transferability across models unknown to the attacker. Overall, this work provides a systematic approach for the self-assessment of an LLM's robustness. We release our code and data at https://github.com/Shukti042/AdversarialExample.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.03128

Country:

Europe (0.93)
North America > United States (0.68)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.52)

Add feedback

c1f0b856a35986348ab3414177266f75-Paper-Conference.pdf

Neural Information Processing SystemsOct-9-2025, 06:37:32 GMT

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
Europe > Switzerland > Zürich > Zürich (0.04)
Europe > Germany > Baden-Württemberg > Karlsruhe Region > Heidelberg (0.04)
(2 more...)

Genre: Research Report (0.46)

Industry:

Information Technology > Security & Privacy (0.93)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

UniC-RAG: Universal Knowledge Corruption Attacks to Retrieval-Augmented Generation

Geng, Runpeng, Wang, Yanting, Chen, Ying, Jia, Jinyuan

arXiv.org Artificial IntelligenceAug-27-2025

Retrieval-augmented generation (RAG) systems are widely deployed in real-world applications in diverse domains such as finance, healthcare, and cybersecurity. However, many studies showed that they are vulnerable to knowledge corruption attacks, where an attacker can inject adversarial texts into the knowledge database of a RAG system to induce the LLM to generate attacker-desired outputs. Existing studies mainly focus on attacking specific queries or queries with similar topics (or keywords). In this work, we propose UniC-RAG, a universal knowledge corruption attack against RAG systems. Unlike prior work, UniC-RAG jointly optimizes a small number of adversarial texts that can simultaneously attack a large number of user queries with diverse topics and domains, enabling an attacker to achieve various malicious objectives, such as directing users to malicious websites, triggering harmful command execution, or launching denial-of-service attacks. We formulate UniC-RAG as an optimization problem and further design an effective solution to solve it, including a balanced similarity-based clustering method to enhance the attack's effectiveness. Our extensive evaluations demonstrate that UniC-RAG is highly effective and significantly outperforms baselines. For instance, UniC-RAG could achieve over 90% attack success rate by injecting 100 adversarial texts into a knowledge database with millions of texts to simultaneously attack a large set of user queries (e.g., 2,000). Additionally, we evaluate existing defenses and show that they are insufficient to defend against UniC-RAG, highlighting the need for new defense mechanisms in RAG systems.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2508.18652

Genre: Research Report > New Finding (0.89)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Add feedback

neurips_attack_recsys

haoyang

Neural Information Processing SystemsAug-18-2025, 15:15:32 GMT

artificial intelligence, machine learning, user behavior, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.47)

Add feedback

neurips_attack_recsys

haoyang

Neural Information Processing SystemsAug-18-2025, 15:15:28 GMT

artificial intelligence, attacker, machine learning, (18 more...)

Neural Information Processing Systems

Country:

Asia > China > Hong Kong (0.04)
Asia > China > Guangdong Province > Guangzhou (0.04)

Genre: Research Report (0.93)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)
(2 more...)

Add feedback

A Representation Engineering Perspective on the Effectiveness of Multi-Turn Jailbreaks

Bullwinkel, Blake, Russinovich, Mark, Salem, Ahmed, Zanella-Beguelin, Santiago, Jones, Daniel, Severi, Giorgio, Kim, Eugenia, Hines, Keegan, Minnich, Amanda, Zunger, Yonatan, Kumar, Ram Shankar Siva

arXiv.org Artificial IntelligenceJul-8-2025

Recent research has demonstrated that state-of-the-art LLMs and defenses remain susceptible to multi-turn jailbreak attacks. These attacks require only closed-box model access and are often easy to perform manually, posing a significant threat to the safe and secure deployment of LLM-based systems. We study the effectiveness of the Crescendo multi-turn jailbreak at the level of intermediate model representations and find that safety-aligned LMs often represent Crescendo responses as more benign than harmful, especially as the number of conversation turns increases. Our analysis indicates that at each turn, Crescendo prompts tend to keep model outputs in a "benign" region of representation space, effectively tricking the model into fulfilling harmful requests. Further, our results help explain why single-turn jailbreak defenses like circuit breakers are generally ineffective against multi-turn attacks, motivating the development of mitigations that address this generalization gap.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2507.02956

Country: